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Abstract 

In this paper we investigate performance of global communica- 
tions in a particular parallel code. The code simulates dynamics of 
expansion of premixed spherical flames using an asymptotic model 
of Sivashinsky type and a spectral numerical algorithm. As a result, 
the code heavily relies on global all-to-all interprocessor communi- 
cations implementing transposition of the distributed data array in 
which numerical solution to the problem is stored. This global data 
interdependence makes interprocessor connectivity of the HPC system 
as important as the floating-point power of the processors of which the 
system is built. Our experiments show that efficient numerical simu- 
lation of this particular model, with global data interdependence, on 
modern HPC systems is possible. Prospects of performance of more 
sophisticated models of flame dynamics are analysed as well. 
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1 Introduction 

The problem under consideration stems from the studies of dynamics of pre- 
mixed flames affected by the hydrodynamic flame instability. These flames 
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are common in science and engineering applications, in particular in working 
out industrial safety requirements in petrochemical plants, in design of in- 
ternal combustion engines, and in astrophysics. Expanding spherical flames 
are one of the most important types of premixed flames and are a popular 
subject of fundamental and applied studies. In particular, a possibility of a 
self-induced transition from deflagration to detonation is the issue of a great 
importance both in practice and theory. 

One of the most distinctive features of these flames is that their expan- 
sion rate does not remain constant, as one would expect, but it grows in 
proportion to the square root of time pQ. Theoretical evaluations indicate 
that the expansion rate should depend on physical dimension of the problem 
and numerical simulations confirmed this for planar circular flames, see e.g. 
[2] and references therein. Calculations of spherical flames carried out so far, 
using an asymptotic Sivashinsky type model, remain inconclusive, because 
it was impossible to cover long enough time intervals to allow flames to sta- 
bilize to the asymptotic power-law regime [3j. The direct three-dimensional 
Navier-Stokes simulations do not look feasible at present at all, because long 
time intervals mean large flame radii, i.e. large Reynolds numbers. 

Mathematical model used in our simulations needs intensive all-to-all data 
exchanges and, as a result, the scalability of our MPI code is expected to 
degrade as the number of processors brought into play grows. Extrapolation 
of the code performance has indicated that the computer resources needed 
to carry out the calculations until stabilization of flames to the asymptotic 
power-law regime may require up to a few processor-years. On the other 
hand, reaching this asymptotic power-law regime of expansion of spherical 
flames numerically is crucial in order to justify studies of pseudoresonant 
interaction of flame wrinkles with the upstream velocity perturbations [I] as 
a possible mechanism of triggering transition of deflagration to detonation. 

Modern massively parallel computers differ from each other by design 
substantially. Accordingly, performances of their inter-processor communi- 
cations differ as well. Hence, some HPC systems and algorithms of global 
interprocessor communications might be more beneficial for one particular 
type of a parallel code than for another. In this paper we describe results of 
porting our Fortran-90/MPI code on SGI Altix 3700, NEC SX-8, and Cray 
XT4 and analyze its comparative performance for a variety of programming 
implementations of global communications. 

In the following Section [2] we outline the governing equation and numer- 
ical algorithm to solve it. Typical results of numerical experiments with 
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expanding flames are illustrated in this section too. In Section [3] we describe 
the structure of our computational code. Results of the benchmark runs are 
presented in Section [U 

2 Mathematical Model and Numerical Algo- 
rithm 

Here we study computational performance of a parallel code for numeri- 
cal simulation of dynamics of three-dimensional flames using the asymptotic 
model suggested in [3] , where a review of other available models can be found 
too. 

Let us consider an expanding flame front and assume that its surface 
is close enough to a sphere and that every point on the flame surface is 
uniquely defined by its distance r = r(9, <p, t) from the origin for < 9 < tt, 
< cf) < 2tt, and t > 0. It is convenient to represent such a flame as 
a perturbation &(9,<f), t) of a spherical surface of a reference radius r (t), 
i.e. r(9,(p,t) = r (t) + $(#,</>,£). Denoting Fourier components of $(6*,0, t) 
as <3?fc, the Fourier image of the equation governing evolution of the three- 
dimensional segments of such flames can be written as 



where k = (ki,k 2 ), I = (h,h) are pairs of integers such that — oo < A^Zj < 
oo, i = 1,2, which can be denoted also as k,l G Z 2 , fk(t) are the Fourier 
components of the properly scaled upstream perturbations of the unburnt gas 
velocity field f((p,t), 7=1 — pb/p u is the contrast of densities p&, p u of the 
burnt and unburnt gases respectively, and initial values of $fc(to) — are 
given. By construction, ([T]) governs dynamics of equatorial flame segments 
—n/(29 w ) <9< n/(29 w ), < < 27r/0 7r , where 9 n , 4> n are integers, see [3]. 

For the unity Lewis number the length and time scales used in ([1]) are the 
thermal flame front width 5 t h = D th /ub and r y~ 2 S t h/ui > respectively, where 
D t h is the thermal diffusivity of the system and u\, is the planar flame speed 
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relative to the burnt gases. Choice of r (t) in the model just introduced may 
be based on a variety of principles, here we use r (t) = t. 

System ([1]) is solved numerically by neglecting the harmonics of orders 
higher than a finite integer number K > 0. Then, the nonlinearity can be 
represented as a circular convolution and evaluated effectively with the FFT. 
The structure of the two-dimensional FFT, as the tensor product of the one- 
dimensional ones, allows for efficient parallelization of the computational 
algorithm. Further details of the numerical integration algorithm can be 
found in [21 [3]. 

In spite of drastic simplifications made during evaluation of the three- 
dimensional asymptotic model ([T]), its numerical simulations are still ex- 
tremely computationally intensive and computational resources available to 
us allowed to consider only a few cases so far. In the case presented be- 
low, the initial perturbations of the spherical flame surface were random 
$(0,0, to) = e?7(6*, <^)- Here \r](8,<f))\ < 1 is a number chosen randomly for 
every required combination of 9 and 0, e = 10~ 3 and t = 200. The explicit 
forcing was not applied in these examples, though the round-off errors were 
playing its role implicitly. Figured] shows evolution of half of a spherical flame 
in the top and of its central region in the bottom. Cells in the flame regions 
near the poles appear severely deformed, because of a simplified treatment 
of the spherical geometry in ([1]), and should be disregarded. 

Averaged velocities of the spherical flames < $ 4 > and their power law 
approximations (t — t*) a are depicted in Fig. [2] as well. As time goes by, the 
expansion rate of the two-dimensional circular flame stabilizes to <$t>oc 
(t-t*) 1/4 , i.e. to the power law predicted by the fractal analysis. The three- 
dimensional flames accelerate much faster than the two-dimensional ones. 
The power law approximation of the expansion rate on the time interval 
considered so far gives a ~ 0.76, indicating good chances that the expansion 
rate will eventually stabilize to <$i>oc (t — t*) 1 ^ 2 after some transitional 
period as demonstrated by the experiment and predicted by fractal analysis. 

Further analysis of results of numerical simulations can be found in [21 [3] . 

3 The structure of the computational code 

The main computational array of the problem is &k lt k 2 , where —K < k\, < 
K and K is a large positive integer. In order to tackle the nonlinear term in 
(CQ) as a circular convolution, the array ^k ly k 2 is augmented to ^k lt k 2 ^ where 
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Figure 1: Evolution of half of a spherical flame (top row) and of its centrally 
located region (bottom row). The view in the latter sequence is limited by a 
cone with fixed apex angle in the origin. Here 7 = 0.8 and /(</>, t) = 0. 
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Figure 2: Averaged flame front velocities and their power law approximations 
(t — t*) a for circular (left) and spherical (right) flames versus time t (right). 
Here 7 = 0.8 and f((j),t) = 0. 



— 2K < k\, k 2 < 2K + k , see e.g. [5]. The size of the augmented array is 
(AK + k + 1) x (AK + ko + 1) and integer k is selected as minimal as possible 
to make value of = 4K + k + 1 efficient for the FFT. 

Initially, the augmented array ^k t ,k 2 is distributed between N p processors 
column-wise, so that its first K p = K^/N p columns are stored in the first 
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processor, the next K p columns are in the second processor and so on. At 
this stage, the one-dimensional FFT can be effectively applied along the first 
index of the arrays fci^j^,^ and A^fci,^) i- e - column-wise. In the next stage, 
the resulting global arrays should be transposed in order to accomplish the 
two-dimensional FFT by applying the one- dimensional FFT along the second 
index, i.e. raw-wise. 

Upon completion of the two-dimensional FFT, the convolution is calcu- 
lated as the element-by-element product of two FFT'ed arrays and the just 
describer set of operations is repeated in the backward direction. First, the 
one-dimensional inverse FFT is applied along the second index of the Fourier 
image of the convolution. Then, the global array is transposed back to its 
column-wise distribution and the one-dimensional inverse FFT is applied 
along the first index. 

The overall structure of the code is illustrated in the block diagram in 
Fig. [31 Interprocessor communications intensive parts of the code correspond 
to rectangles with smoothed vertices and floating-point arithmetics intensive 
modules are denoted by standard rectangles. Computational resource con- 
sumption in blocks not shown in the diagram is not essential. The code was 
written in Fortran-90 and MPI. Vendor recommended FFT routines were 
used on all tested HPC systems. 

The problem of transposition of data arrays distributed among a num- 
ber of processors is quite common in scientific and engineering applications 
of HPC. A number of strategies and algorithms were suggested for various 
computer architectures and data structures, see e.g. [6], [T] . Here, we are con- 
sidering the simplest column- and raw-wise distributions of two-dimensional 
arrays only and study their transpositions by straightforward MPI tools on 
a few popular HPC systems. 

The straightforward implementation of transposition of distributed data 
arrays can be carried out using the MPI_ALLTOALL routine. Before the call to 
the MPI.ALLTOALL, local arrays should be reshaped, because the MPI_ALLTOALL 
routine requires the data destined for a particular processor to be stored in 
RAM continuously. After the call, the newly assembled local arrays should 
be reshaped back, now in the raw-wise manner in terms of the reference 
distribution of the global array, in order to apply the one-dimensional FFT 
along the second index. The entire procedure trebles the required memory, 
but RAM is usually less critical in the problem in question than the interpro- 
cessor communications efficiency This transposition procedure is illustrated 
in Fig. H] and the code itself is given in the Appendix. 
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Figure 3: The block-scheme of the code. Red blocks are communication and 
cyan ones are floating-point operations intensive. 



The MPI_GATHERV routine can be called by every processor in order to 
fetch missing (N p — l)K% elements of ^k ly k 2 an d form its raw-wise repre- 
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Figure 4: The global data array transposition procedure. The illustration 
features a 6 x 6 array distributed over 3 processors Pi, P2, and P3. 



sentation as an alternative to the use of the MPI_ALLTOALL routine in the 
implementation of the global array transposition. This approach avoids re- 
shaping of local arrays and is RAM economical. However, in our experiments 
all tested HPC systems performed faster with the transpositions based on 
the MPI_ALLTOALL routine. As RAM is usually less critical to the problem 
in question than the interprocessor communications efficiency, the latter ap- 
proach appears inferior to the former one on HPCs with sufficient memory 
resources. Accordingly, in what follows we present results for the algorithm 
based on the MPI_ALLTOALL routine only. 

The transposition of the global array requires in our case cross-processor 
transmission of N comm oc — N' 1 ) elements of ^ki,k 2 an d is the most 

communication resource-demanding segment of the code. Combined use of 
the FFT routine requires Nf p = 0{K\ log 2 K^) floating-point operations 
per time step making it the most critical arithmetic resource consumer. 
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Thus, the ratio of numbers of floating point to communication operations 
is N fp /N comm = 0(log 2 -KW/(1 - Np 1 )), i.e. practically N fp /N comm « 1 
for N p ^> 1. In other words, speed of communications is as important as 
the speed of arithmetic operations. This is in startling contrast with ex- 
plicit approximations of local PDE's like Navier-Stokes or Maxwell systems. 
In the latter case explicit approximations of local derivatives would require 
communications just between neighbouring processors involving grid nodes 
from close proximities of their common boundaries only. Hence, the numbers 
of communication and arithmetic operations will be N comm = 0(K^,N p ) and 
Nf p = O(Ky), resulting in Nf p /N comm = 0(K^/N p ) and meaning that speed 
of interprocessor communication is not that important as speed of arithmetic 
operations for N p <C Ky. 

Another interprocessor communications intensive routine of our code is 
the symmetrization of the global solution array in order to ensure it remains 
real valued in physical space. Programmatically this is carried out by map- 
ping the upper half of the global array onto the lower one in accordance with 
the symmetry relationship \I/_ fcli _& 2 = ^ki,k 2 i see Fig- El The routine trans- 
mits roughly half of the array *&ki,k 2 an d is called once per time step, which 
makes it less resource demanding than the transposition one. It is built of 
the non-blocking point-to-point communications subroutines MPI_ISEND and 
MPI_IRECV. Similarly, the numerical integration to advance the solution in 
time needs some floating-point arithmetics, but its resource requirements are 
inferior to the FFT's. 

The routine, which calculates a variety of averages of obtained numerical 
solution does both interprocessor communications, by using MPI_ALLREDUCE, 
and floating-point arithmetics. Again, its resource demands are inferior to 
the crucial blocks identified above. In addition, calls to this routine are not 
compulsory at every time step. 

4 Comparative performance of the code 

The Fortran-90/MPI code underwent some basic optimization during porting 
on a particular HPC system. Such basically optimized clones of the code were 
run on a set of HPCs with varying numbers of processors and solution array 
sizes. A few runs of the original non-optimized code have been carried out too 
in order to illustrate the value of adjustments to particular HPC architecture. 
The NEC SX-8 required the largest optimization effort, though the pay off 
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Figure 5: The global data array symmetrization procedure. Data from the 
upper-right corner are spread to the rest of the array according to the real 
symmetry relation *_ fel _ fa = ^ kl ,k 2 - 



was the most ample too. 




Figure 6: Comparison of the wall clock time t^ p needed to complete the task 
(left) and their ratios to the wall clock time for NEC SX-8 (right) versus 
number of processors N p . Tag ini corresponds to an initial non-optimized 
code, its absence - to the optimized ones. 

Graphs in Fig. [6] show that the code in question runs on the SX-8 on 
average about three times faster than on Altix and about twice faster than 
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on XT4. Further, they reveal that the optimized code is perfectly scalable 
in spite of its global all-to-all communications. Tests were carried out with 
ranging from 2000 to 32000 and the graphs correspond to Kq, « 8000. 
Variations of in the given limits does not affect behaviour of graphs in 
relation to each other significantly. 




Figure 7: Comparison of the speed up factor t Np=4: /t Np versus number of 
processors N p . Tag ini corresponds to an initial non-optimized code, its 
absence - to the optimized ones. 

Our calculations on Altix were practically limited by 48 processors and by 
32 processors on SX-8. However, it was possible to use up to 1024 processors 
of XT4. Corresponding graph is depicted in Fig. [7J It reveals, that scalability 
of the code starts to degrade from N p w 128. However, even for N p = 1024 
the code manages to utilize at least 40% of the added computing power. 

Profiling shows that further improvement of the code performance on 
all tested systems is impeded by the MPI_ALLT0ALL routine and to a lesser 
degree by the FFT procedures. Results of comparison of performances of 
our particular code on a set of HPCs should not be interpreted as the overall 
superiority of one of those HPCs over others. 

Being computationally efficient and physically plausible, the Sivashinsky 
type models are not free from drawbacks. For example, they neglect certain 
types of low order spherical harmonics and are not valid for the entire ex- 
panding flame front, see e.g. [3]. Global geometrically correct asymptotic 
models were suggested as well, see e.g. In planar geometry the governing 
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equation of such a model is as follows 



dr 
~dt 



l +eK ( r> t) + 2 i + 




n(r,t). (2) 



Here e and 7 are physical constants, k is the curvature and dl$ is the increment 
of length of contour C. Other notations are explained in Fig. [HJ 



Figure 8: Geometry of the planar flame front C propagating outwards, n is 
the inner unit normal. 

Unlike the Sivashinsky models, the right-hand side of the latter ones 
cannot be represented as a tensor product of one-dimensional operators pre- 
cluding use of efficient column/row-wise data structure and causing further 
problems in efficient parallelization. However, its singular integral in the 
right-hand side can be treated with the fast multipole techniques [9] reduc- 
ing the number of required floating point operations per time step to roughly 
Nf p = 0(K 2 ). The number of required communications per time step will 
be in the range 0(K 2 ) < N comm < 0(N p K 2 ), meaning that the efficiency of 
computations with this model will be defined by the quality of programming 
and acceptable compromise with accuracy. 

On the other hand, explicit economical discretization of the 3-D Navier- 
Stokes system would lead to Nf p = 0(K 3 ), and approximately the same 
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number of communications. It should be noticed however, that approxi- 
mation of the Navier-Stokes system in the radial direction would need much 
larger number of grid nodes than in the azimuthal ones, K, in order to resolve 
the structure of the flame front. Furthermore, the Navier-Stokes approach 
would need to operate with a large set of physical parameters, e.g. concentra- 
tions of reacting chemical species. Thus, it appears that for practical values 
of K, arithmetic and communication demands of the Navier-Stokes models 
will be greater than of model [8]. Although, complexity of approximation of 
the singular integral and the asymptotic nature of model [8] may annul the 
benefit of reduction of the problem dimension by one. 

5 Conclusions 

Global data array transposition is the most resource consuming operation 
in simulations in question. Developed MPI code based on a Sivashinsky 
type model and MPI_ALLTOALL routine for transposition of column-wise dis- 
tributed data arrays is perfectly scalable for up to a hundred of processors. 
On Cray XT4 increase of N p from 128 to 1024 utilizes at least 40% of added 
computational power. Performance of the NEC SX-8 both as a massively 
parallel system and as a "number cruncher" is superior to other tested sys- 
tems. Achieving reasonable code efficiency requires intensive optimization 
on all considered systems. 

Extrapolation of the Cray XT4 and NEC SX-8 performance shows that 
with their help reaching the stabilized acceleration regime of the expand- 
ing three-dimensional spherical flame numerically is feasible. Studies of the 
auto-transition of such flames to detonation on this computer are realistic as 
well. Prospects of use of massively parallel systems for more sophisticated 
asymptotic models of flame dynamics are less encouraging, but not hopeless. 

Comparison of performances of computers was carried out on a particular 
code only and should not be interpreted as the overall superiority of one HPC 
system over another. 
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Appendix. The Fortran 90/MPI distributed 
array transposition routine 



MODULE sphe_d2_mpi_mach 

INTEGER, PARAMETER : : ikd=KIND(l) ! main integers 

INTEGER, PARAMETER : : rkd=SELECTED_REAL_KIND (P= 1 5 ) ! main reals 
INTEGER, PARAMETER : : ikdO=KIND(l) ! MPI integers 

END MODULE sphe_d2_mpi_mach 

SUBROUTINE Transp2 (u , v , nproc , Kex , nconv , nconv_mpi , ut , vt ) 
! Row-to-column transposition of [-Kex:nconv-Kex-l]x[-Kex:nconv-Kex-l] 
! distributed in chanks of [-Kex:nconv-Kex-l]x[l :nconv_mpi] arrays 
! over nproc processors 

! u,v - initial and transposed arrays correspondingly 

! nproc - number of processors 

! Kex=2*K 

! nconv=4*K+kO+l 

! nconv_mpi=nconv/nproc 

! ut,vt - internal work arrays 

USE sphe_d2_mpi_mach 
USE MPI 
IMPLICIT NONE 

INTEGER (KIND=ikdO) : : ier.mpi 

INTEGER (KIND=ikd) :: j , 1 , nproc , Kex, nconv, nconv_mpi 
REAL(KIND=rkd) :: u(-Kex:nconv-Kex-l , 1 :nconv_mpi) 
REAL(KIND=rkd) :: v(-Kex : nconv-Kex-1 , 1 : nconv.mpi) 
REAL(KIND=rkd) :: ut(l :nconv_mpi*nconv_mpi, 1 : nproc) 
REAL(KIND=rkd) :: vt(l :nconv_mpi*nconv_mpi, 1 : nproc) 
! 

! Executables 

IF(nproc==l)THEN 

DO j=-Kex,nconv-Kex-l 
v(: , j+Kex+l)=u(j , :) 

END DO 
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ELSE 
DO l=l,nproc 

DO j=l ,nconv_mpi 

ut ( ( j-1) *nconv_mpi+l : j*nconv_mpi ,1) & 
& =u ( (1-1) *nconv_mpi-Kex : l*nconv_mpi-Kex-l , j ) 
END DO 
END DO 

CALL MPI.ALLTOALL (ut , nconv_mpi*nconv_mpi , MPI_DOUBLE_PRECISION , & 
& vt,nconv_mpi*nconv_mpi,MPI_DOUBLE_PRECISION, & 

& MPI_COMM_ WORLD, ier_mpi) 

DO j=l ,nconv_mpi 
DO l=l,nproc 

v ( (1-1) *nconv_mpi-Kex : l*nconv_mpi-Kex-l , j ) & 
& =vt(j : (nconv_mpi-l) *nconv_mpi+j :nconv_mpi , 1) 
END DO 
END DO 
END IF 
RETURN 

END SUBROUTINE Transp2 
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